Red Flag/Blue Flag visualization of a common CNN for text classification

Abstract A shallow convolutional neural network (CNN), TextCNN, has become nearly ubiquitous for classification among clinical and medical text. This research presents a novel eXplainable-AI (X-AI) software, Red Flag/Blue Flag (RFBF), designed for binary classification with TextCNN. RFBF visualizes each convolutional filter’s discriminative capability. This is a more informative approach than direct assessment of logit contribution, features that overfit to train set nuances on smaller datasets may indiscriminately activate large logits on validation samples from both classes. RFBF enables model diagnosis, term feature verification, and overfit prevention. We present 3 use cases of (1) filter consistency assessment; (2) predictive performance improvement; and (3) estimation of information leakage between train and holdout sets. The use cases derive from experiments on TextCNN for binary prediction of surgical misadventure outcomes from physician-authored operative notes. Due to TextCNN’s prevalence, this X-AI can benefit clinical text research, and hence improve patient outcomes.


INTRODUCTION
Red Flag/Blue Flag (RFBF) is eXplainable-AI (X-AI) software to visualize a convolutional neural network (CNN) architecture, TextCNN, 1 that is the "standard baseline for new text classification architectures." 2 RFBF is written in Python 3.10 and supports a PyTorch 3 implementation of TextCNN. 1 Despite model simplicity, recent researches indicate strong classification performance 2,[4][5][6][7][8] ; with superior results to most deep learning models, 2,8 and similar results to memory-intensive transformers. 2,8,9 Lu et al 8 show TextCNN 1 to outperform 5 other common network models-including BERT 9 -for 16 binary classification tasks from discharge notes, with one-tenth the training time of BERT. This research is influenced by 3 X-AI techiques 10-12 developed for TextCNN. 1 1. Both this research and Jacovi et al 11 present a model interpretability through display of a filter activation grid-a textual analogue to the Zeiler-Fergus (ZF) 13 display. However, RFBF displays differences in filter logit output between outcomes. A comparison of these interclass logit differences across datasets provides additional insights into model behavior. 2. Cheng et al 12 developed a sample interpretability technique to calculate each token's contribution to a prediction. The derivation served as a basis for this research. 3. Zhao et al 10 present phrase-level SHAP 14 (Shapley Additive exPlanations) features for both model and sample interpretability. The authors rank phrases with methods that account for frequency and SHAP magnitude. Our X-AI is complimentary: RFBF focuses on model filters instead of phrases, and the interclass difference in feature contribution to logit outputs instead of SHAP values.
In summary, a novel aspect to RFBF is that filter concepts are presented through interclass logit differences, as opposed to solely logit magnitudes.

TextCNN architecture
Previous researches explain the TextCNN architecture 1 (Figure 1) in detail. [10][11][12] We highlight the following layers: 1. Convolution layer: convolution with k filters, R nÂe ! R nÂk , which are split across different ngram sub-layers. The subsequent max pool scans the text input for the ngram that most closely resembles the filter to output 1 activation per filter, R nÂk ! R k . A moniker for TextCNN 1 could be "scan-CNN." 2. Fully connected (FC) layer: outputs 2 logits for binary prediction, R k ! R 2 .

TextCNN architecture modifications
For binary classification, the FC layer of TextCNN 1 can be modified to output a single logit, R k ! R, instead of 2 ( Figure 1).
The modified architecture has equivalent capacity: a function that is modeled as 2 logits passed to a softmax can be represented with a single logit and its negation passed to a softmax; where the single logit is the halfway distance between the 2 original logits.
The modification results in a bijective relationship between the K fully connected weights and K convolutional filters, which facilitates visualization since each filter contributes to one class. In contrast, each original TextCNN 1 filter contributes to both classes. However, the 2-d FC weights of post-trained TextCNN 1 models can be converted to 1-d weights as halfway distances, with the logit output and it's negation passed to softmax.
Each filter outputs a logit, l k . The filter logits are summed to output a sample logit.
The CNN is therefore an ensemble summation of K classifiers, where each filter is a classifier. The bias does not affect interclass difference metrics such as Area Under Curve (AUC).
Only the ngram section at location j that passes the max pool for filter k, x k j:jþn , contributes to the filter's logit, l k . This activation is calculated via dot product, x k j:jþn Á w k , which is scaled by FC k .
The 4 multiplicative components of l k : • Cosine similarity, h e!u is the angle between the ngram embedding, e k j:jþn , and convolution filter, u k . • Embedding magnitude, jjx k j:jþn jj • Sample-independent filter importance, Imp k ¼ jjw k jj Â jFC k j • Class membership: sign FC k ð Þ To reduce clutter, the convolution kernel biases are not notated. These biases do not affect interclass performance and can theoreti-cally be summed into the FC bias after multiplication with FC k . However, they are retained to facilitate training.
The max pool excludes negative activations in this data, which would flip the classification of FC k . A pre-FC ReLU can exclude them in other datasets.

Logit deltas
Discriminative capability depends on the distributional difference (delta) in logit outputs between outcome classes. For small or noisy datasets, direct interpretation of filter logits on unseen data-without consideration of discriminative capability-may emphasize overfit features associated with large logits from both classes.
To reflect discriminative capability, RFBF displays each filter's (1) AUC; and (2) difference in the median/mean logit value between patient and control sets:

Visualization
RFBF outputs an HTML table. Each table row corresponds to a filter, and the 2 columns to data subsets, i.e. train and validate. Each table cell portrays the top 4 activated ngrams for the associated filter (row) and dataset (column), and the following information for each ngram ( Figure 2): • The number of ngram instances in the dataset as a patient (red), and control (blue) • Logit • Cosine similarity between embedding and filter • embedding magnitude Each cell heading portrays ngram-independent information: • The filter's index, k, in the classification layer.
• The filter rank according to sort options • AUC • Logit delta and calculation inputs • Patient logit: The median or mean (user specified) logit in the patient set, in red. • Control logit: analogous to patient logit, in blue. • Delta: patient logit minus control logit • Imp k ¼ jjw k jj Â jFC k j, red for patient (sign FC k ð Þ¼1), otherwise blue RFBF provides options to rank the filters via discriminative performance (mean/median logit delta, AUC) on the first or second dataset; or by filter importance, Imp k .
The RFBF user-interface follows a 2-step process. (1) Invoke the function calc_zf_dict on the CNN model and the train, validate, and holdout datasets to obtain 3 ZF 13 dictionaries. The keys are filter indices in the classification layer. For example: our classification layer has 144 weights, hence the indices range from 0 to 143. (2) Input any 2 of the 3 ZF objects to the make_zfs_table function.
The 2-step design enables the user to filter, graph, and perform exploratory analyses of the ZF dictionaries.

Data
We experiment with prediction of surgical misadventure from physician-authored operative notes obtained across diverse surgical types at the Medical University of South Carolina between May 2015 and July 2019. The control set is composed of 6800 notes from 5135 controls. The patient dataset is composed of 2 subsets: (1a) WM outcome-13 254 notes from 7628 patients who had an ICD-10 code for "adverse event without mention at time of operation" (Y83-Y84.9); and (1b) D&I outcome-840 notes from 537 patients who had an ICD-10 code for device & instrument (Y62.0-Y82.9) surgical misadventures. The data are not limited to an operation type. The control group is matched on patient demographics (Table 1).

Training pipeline
The CNN training pipeline is as follows: 1. Clean data with regular expressions and other string operations.
a. The code can be found in the "cleanTxt" function of the tex-t_utils.py file. 15 The operations include removal of special characters, conversion to ASCII, etc. 2. Stratify data by patient, except intentional overlap during the third use case to estimate information leakage. 3. Run 1-Cycle policy 15,16 a. 10 iterations b. In each iteration, split the data into 75% (train), 15% (validate), and 10% (holdout) sets.

Use cases
We present 3 use cases:  the 5-gram filter that extracts patient name, in the form of "patient name . . ." b. For each sample, extract patient name from the ngram at the identified filter's max pool location. c. Calculate patient name overlap between train and validate. d. Apply a-c. in simulations that increase overlap between train and holdout sets (0%, 10%, 30%, 50%).

Use case 1: Filter consistency
The top 3 filters with the greatest mean logit difference on the train set for WM classification focus on patient names. The highest rank filters for the holdout set extracts history, post-operative diagnosis, revision, infection, and hardware ( Figure 2). For D&I, different sort arrangements result in more consistent ranks with a focus on hardware removal, infection, and specimen distribution. However, when ranked by train set AUC, the first D&I filter extracts Electronic Health Record (EHR) template features (Supplementary Figure S1).

Use case 2: Performance improvement
A plot of the median logit deltas between the train and validate sets for experiment 7 (highest AUC, 0.839), D&I, reveals filter 1 as an outlier. Where "1" references the filter index in the ZF dictionary and classification layer. RFBF indicates distribution drift for the filter. For example, "mandible" is nearly control set exclusive in the train set, but it is slightly associated with patients in the validate set ( Figure 3). Removal of filter 1 led to an improvement of holdout AUC from 0.839 to 0.843.

Classification performance
TextCNN surpasses chance on these challenging sets. Device & Instrument performance surpassed Without Mention. This may be because WM operative notes do not record the safety incident at time of operation.
Use case 1: Filter consistency RFBF indicates that many of the most activating features are overfit train features-template artifacts, patient names, etc. However, ranks by validate/holdout performance reveal relevant features.
Patient name-extraction filters rank highest on the train set for WM, hence X-AI techniques that rank by logit magnitude would emphasize patient name features. In comparison, a filter sort via discriminative performance on the validate set reveals informative features: revision, loose/painful hardware, and infection.

Use case 2: Performance improvement
For D&I classification, the following 3-step process led to a holdout AUC improvement from 0.839 to 0.843. (1) Identify an outlier filter on a plot of validate-train logit deltas ( Figure 3A); (2) confirm a train-validate distribution shift in RFBF; and (3) remove the filter. While the improvement is slight, it can impact model rank experiments. For example, this improvement is larger than the gap between second place TextCNN and first place 1-layer encoder in a recent research 8 that compares neural network models for medical text classification.
A filter that overfits a concept may still provide a classification benefit if (1) it learns more than one concept; and (2) the proportions of the overfit feature do not favor one class over another on unseen data.
To prevent overfitting, the holdout set should only be analyzed at the end of experiment.
Use case 3: Estimation of Train-Holdout information leakage through feature extraction RFBF can serve as a starting point to identify features that inflate performance. For example, a data loader bug may load samples from the same patient in the train and validate/holdout sets. Such features are readily visible in RFBF because they tend to have high train set logit deltas. We present strong results for overfit estimation via patient-name extraction.
While patient names are ideally masked, the results show that names serve as informative data points to detect information leakage. We therefore propose a workflow that first uses RFBF to check for information leakage via patient name, and the researcher then continues analysis with preprocessed name masking if there is no evidence of leakage.

Source code
The RFBF source code is on GitHub. 15 The surgical datasets are not publicized due to Protected/Personal Health Information (PHI). Instead, the publicized sample is a set of 6726 PubMed abstracts (2268 positives, 4458 controls). The outcome variable is presence of a MESH term for Hepatitis. The model achieved an AUC of 0.966.

Limitations and future research
Future research can improve upon this research.

AUTHOR CONTRIBUTIONS
The main author, JDG, developed the software. JSO mentored the implementation of TextCNN and acquisition of datasets. AVA supervised the technical derivations. KRC provided feedback regarding surgical safety expertise.

SUPPLEMENTARY MATERIAL
Supplementary material is available at JAMIA Open online.

CONFLICT OF INTEREST STATEMENT
None declared.

DATA AVAILABILITY
The referenced surgical notes dataset is PHI-sensitive and not publicly available. The Red Flag/Blue Flag GitHub page contains a corpus of 6726 PubMed case reports (2268 positives with MESH term for Hepatitis C, 4458 controls). The dataset is composed of case reports of 5 MESH-term based queries pulled from the PubMed API: (1) "Cirrhosis," (2) "Hepatitis C," (3) "Hepatitis," (4) "Nonalcoholic Fatty Liver Disease," and (5) a set of queries with no specified MESH term. All queries subject to time constraints 2011 to 2018.